Building Balanced k-d Tree with MapReduce
نویسنده
چکیده
The original description of the k -d tree recognized that rebalancing techniques, such as are used to build an AVL tree or a red-black tree, are not applicable to a k -d tree. Hence, in order to build a balanced k -d tree, it is necessary to obtain all of the data prior to building the tree then to build the tree via recursive subdivision of the data. One algorithm for building a balanced k -d tree finds the median of the data for each recursive subdivision of the data. A new algorithm builds a balanced k -d tree by presorting the data in each of k dimensions prior to building the tree, then constructs the tree in a manner that preserves the order of the k presorts during recursive subdivision of the data. This new algorithm is amenable to execution via MapReduce and permits building and searching a k -d tree that is represented as a distributed graph.
منابع مشابه
Building a Balanced k-d Tree in O(kn log n) Time
The original description of the k -d tree recognized that rebalancing techniques, such as are used to build an AVL tree or a red-black tree, are not applicable to a k -d tree. Hence, in order to build a balanced k -d tree, it is necessary to find the median of the data for each recursive subdivision of those data. The sort or selection that is used to find the median for each subdivision strong...
متن کاملOn Integer Sequences Derived from Balanced k-ary trees
This article investigates numerous integer sequences derived from two special balanced k-ary trees. Main contributions of this article are two fold. The first one is building a taxonomy of various balanced trees. The other pertains to discovering new integer sequences and generalizing existing integer sequences to balanced k-ary trees. The generalized integer sequence formulae for the sum of he...
متن کاملA New Parallelization Method for K-means
K-means is a popular clustering method used in data mining area. To work with large datasets, researchers propose PKMeans, which is a parallel k-means on MapReduce [3]. However, the existing k-means parallelization methods including PKMeans have many limitations. It can’t finish all its iterations in one MapReduce job, so it has to repeat cascading MapReduce jobs in a loop until convergence. On...
متن کاملMR-Tree - A Scalable MapReduce Algorithm for Building Decision Trees
Learning decision trees against very large amounts of data is not practical on single node computers due to the huge amount of calculations required by this process. Apache Hadoop is a large scale distributed computing platform that runs on commodity hardware clusters and can be used successfully for data mining task against very large datasets. This work presents a parallel decision tree learn...
متن کاملScalaGiST: Scalable Generalized Search Trees for MapReduce Systems [Innovative Systems Paper]
MapReduce has become the state-of-the-art for data parallel processing. Nevertheless, Hadoop, an open-source equivalent of MapReduce, has been noted to have sub-optimal performance in the database context since it is initially designed to operate on raw data without utilizing any type of indexes. To alleviate the problem, we present ScalaGiST – scalable generalized search tree that can be seaml...
متن کامل